Skip to main content

Full text of "Adaptive Monotone Shrinkage for Regression"

See other formats


arXiv: 1505.01743vl [stat.ME] 7 May 2015 


Adaptive Monotone Shrinkage for Regression 


Zhuang Ma Dean Foster 

Department of Statistics Yahoo Lab 

University of Pennsylvania New York, NY 10018 

Philadelphia, PA 19104 dean@foster.net 

zhuangma@wharton.upenn.edu 


Abstract 

We develop an adaptive monotone shrinkage es¬ 
timator for regression models with the following 
characteristics: i) dense coefficients with small 
but important effects; ii) a priori ordering that in¬ 
dicates the probable predictive importance of the 
features. We capture both properties with an em¬ 
pirical Bayes estimator that shrinks coefficients 
monotonically with respect to their anticipated 
importance. This estimator can be rapidly com¬ 
puted using a version of Pool-Adjacent-Violators 
algorithm. We show that the proposed monotone 
shrinkage approach is competitive with the class 
of all Bayesian estimators that share the prior in¬ 
formation. We further observe that the estima¬ 
tor also minimizes Stein’s unbiased risk estimate. 
Along with our key result that the estimator mim¬ 
ics the oracle Bayes rule under an order assump¬ 
tion, we also prove that the estimator is robust. 

Even without the order assumption, our estima¬ 
tor mimics the best performance of a large family 
of estimators that includes the least squares es¬ 
timator, constant-A ridge estimator, James-Stein 
estimator, etc. All the theoretical results are non- 
asymptotic. Simulation results and data analysis 
from a model for text processing are provided to 
support the theory. 

1 INTRODUCTION 

Feature selection and coefficient estimation are familiar 
topics in both statistics and machine learning communities. 
Many results in this area concern models that are "nearly 
black,’ possessing a handful of large effects against a wide 
field of noise. Consider the widely used linear model 

Y = X0 + e where e ~ N( 0, cr 2 I n ) , (1) 

X is full-rank, n x p matrix of explanatory features with 
p < n, and 0 is a p dimensional vector of unknown coeffi- 


Robert Stine 

Department of Statistics 
University of Pennsylvania 
Philadelphia, PA 19104 
stine@wharton.upenn.edu 


dents. In the "nearly black’ case, all but a few of the coor¬ 
dinates of 0 are zero. A long sequence of results leverage 
this sparsity (Foster and George [1994]; Tibshirani [1996]; 
Abramovich et al. [2006]; Candes and Tao [2007]; Fan and 
Lv [2008]; Bickel et al. [2009]). Sparsity assumptions are 
well suited to many applications, especially within the field 
of signal and image processing (Donoho [1995], Wright 
et al. [2009]). 

Despite the prevalence of research on sparse models, some 
applications do not conform to this paradigm. For example, 
Foster et al. [2013] used methods such as latent semantic 
analysis, essentially principal components analysis (PCA), 
to convert text into features for regression analysis. The 
estimated coefficients of these principal components show 
two specific characteristics that draw our attention: dense 
coefficient estimates with a monotonically decaying effect 
size. Rather than concentrate in a few estimates, the pre¬ 
dictive power of the model spreads across many features. 
Sparsity-based methods such as hard or soft thresholding 
that set small effects to zero produce fitted models with 
greatly diminished predictive ability. Too much predic¬ 
tive signal has been lost by eliminating small, but nonethe¬ 
less informative, coefficients. Dense coefficients appear in 
other applications as well. Hall et al. [2009] and Dicker 
[2011, 2012] also propose models for dense signals and 
Dicker [2011, 2012] discusses several shrinkage estimators 
in high dimensions. 

The second characteristic of this application is the mono¬ 
tone decrease in typical effect size. The signal tends to 
concentrate in the leading principal components, then grad¬ 
ually decay. We may not know the signal strength, but we 
do have an ordering. In this sense, the unsupervised PCA 
of the text data provides useful information that can be ex¬ 
ploited within the regression. In particular, the eigenvalues 
from the PCA provide an external ordering of the features 
that is suggestive of the effect size. Such exogenous infor¬ 
mation appears in other domains. In time series analysis, 
data collected more recently are expected to be more in¬ 
formative for the prediction of future trends. In principal 
components regression, we tend to expect the first principal 


component to be more important than the second. There¬ 
fore, models without incorporating this prior knowledge 
might be suboptimal. 

In this paper, we capture both characteristics with an em¬ 
pirical Bayes estimator that shrinks coefficients monotoni- 
cally with respect to their anticipated importance. The pro¬ 
cedure is tuning free and can be efficiently implemented 
using Pool-Adjacent-Violators algorithm. We further show 
that the estimator can be derived from frequentists’ per¬ 
spective as well by minimizing Stein’s unbiased risk es¬ 
timate. Finally, we establish non-asymptotic results to gau- 
rantee that the proposed estimator is nearly Bayes optimal 
under the order assumption and even when the order as¬ 
sumption (or say, prior knowledge) is wrong, it still mim¬ 
ics the best performance of a large family of estimators that 
includes the least squares estimator, ridge estimator, James- 
Stein estimator, etc. 


I p . Then the Bayes rule p* in (1) and (2) is: 


where p = (pi, ■ ■ ■ , P p ) = X T Y is the least squares esti¬ 
mator. The Bayes rule shrinks pi monotonically, shrinking 
more and more harshly as the index and of increase. For 
our application, we know only the order of the features, not 
the signal strength of, so the Bayes rule is not a real es¬ 
timator because it depends on the unknown parameter of. 
To mimic the performance of the Bayes rule, we estimate 
the of s from data under the order constraint and use the 
resulting plug-in estimator. For convenience, we write the 
model as /? ~ N( 0, E) where E = diag(a\ , • • • , of). 

2.2 Maximum Marginal Likelihood Estimator 


The rest of this paper is organized as follows. In sec¬ 
tion 2, we describe the monotone shrinkage model in de¬ 
tail and introduce the maximum marginal likelihood esti- 
mator(MMLE) and Pool-Adjacent-Violators algorithm. In 
section 3, we show that the proposed estimator also mini¬ 
mizes Stein’s unbiased risk estimate and establish its non- 
asymptotic oracle properties both with and without order 
assumption. In section 4, we suggest an estimator of the er¬ 
ror variance a 2 . In section 5, we present simulation results 
and an analysis of text to support our theory. Concluding 
remarks are given in section 6. Details of the technique are 
provided in the Appendix. 


2 ADAPTIVE MONOTONE SHRINAKGE 

2.1 Model Formulation 

We use a Bayesian framework to encode the prior knowl¬ 
edge about the importance of explanatory features in our 
model. We express this prior knowledge in a distribution 
of the coefficients (3. Intuitively, if the features within the 
regression are standardized such that X T X = I p , then the 
coefficient |/3j| gives the importance of the ith feature: a 
unit change in X t is associated with a change of |/3j| in 
the response. A natural prior that captures the sense that 
the elements of p have decaying size specifies a monotone 
decreasing sequence of variances for the coefficients. For 
convenience, we assume that the size of signal in pi is de¬ 
creasing with the index i: 


Pi ~ N(0, af) 

of > of > • • • > a 2 p > 0 

Since we know the order, we can always rearrange Pi so 
that the unknown of are monotone as above. Throughout, 
we consider only orthonormal designs for which X T X = 


To estimate the ordered prior variances on the diagonal 
of E, we observe that the marginal distribution of Y is 
N( 0, XYjX t + cr 2 I n ). If we further assume the error vari¬ 
ance a 2 from (1) is known (or we can plug in a consistent 
estimator), a simple calculation shows that the least square 
estimator p = X T Y is a sufficient statistic for E. So, in the 
following discussions, we base our inference on p, whose 
marginal distribution is N(0,cr 2 I p + E). A natural esti¬ 
mator of E is the maximum marginal likelihood estimator 
(MMLE). The log marginal likelihood function is 

((E) = ^log(27r) + logO 2 + of) + j 


Consequently, the MMLE is the solution to the following 
optimization problem: 


arg min yf log(o- 2 + of) + 

*=i V 


(T 2 + a 2 


(3) 


subject to v 2 i>v%> ■■■><?;> o 


2.3 Pool-Adjacent-Violators Algorithm 

The optimization problem (3) resembles the well-known 
isotonic regression problem 

n 

p lso = arg min yf (y t — Pi) 2 subject to P i > • • • > p n 

■ i= 1 

(4) 

whose unique solution can be efficiently obtained by 
running the Pool-Adjacent-Violators (PAV) algorithm. 
Roughly speaking, this algorithm solves (4) as follows. Set 
i = 1. Move to the right (increase the index i) until finding 
a pair (?/,:, y t +i) that violates the monotonicity constraint, 
that is yi < Vi+i- Pool y, : and the adjacent Di+i and replace 
both by their average. Next check whether r/j_i < ?;,+ f ,+1 . 
If so, replace (j/*-i. y.j, y i+ 1 ) with their average. Continue 



to the left until monotonicity is satisfied and then proceed 
to the right until the whole sequence is monotone. Hence, 
PAV algorithm outputs an decreasing blockwise constant 
sequence. As far as we know, the PAV algorithm dates back 
to Ayer et al. [1955], where it is used to compute the MLE 
of independent binomial distributions. Brunk [1955, 1958] 
considered rather general scenarios and established some 
consistency properties. According to Grotzinger and Witz- 
gall [1984], if carefully implemented, the PAV algorithm 
has computational complexity 0 (n). 

Although the optimization problem (3) is not convex, it can 
be solved efficiently by the PAV algorithm. Before estab¬ 
lishing this result, we introduce some notations. 

fi(x) = log (a; + cr 2 ) + ^ 

X + <7 Z 

of = arg min/j(a;) = P? - o 2 

X 

Proposition 2.1. The following two-step algorithm pro¬ 
duces the MMLE denoted by (a 2 , • • • , a 2 ) 

Step!, (of, ■ ■ ■ , a 2 ) = PAV(of,■■■ ,<r 2 ) 

Step 2. a 2 = dfl {6 |> 0) . 

For those of estimated to be 0, it means the corresponding 
features are not included in the model. To prove the 
proposition, we first introduce a lemma: 

Lemma 2.1. Consider optimization problem 

p 

min ^/i(0i) 

subject to: 9\ > 62 > • • • > 9 V 

where the elementwise solution 61 = arg mine fi(9) is fi¬ 
nite. If {fi, ■ ■ ■ , f p } satisfies the pooling condtion defined 
below, then the optimization problem has a unique solution 

(0i,- J P )=PAV(e ir -- ,e p >. 

Definition 1 (Pooling Condition). For a sequence of func¬ 
tions {fi,--- ,/p} with 9i = arg min 0 fi(9) being finite. 
Let 6 i:j = J2l=i &k/(j ~i + !)• We say (A, - - - , f p } sat¬ 
isfies pooling condition if Mi < j, argmin fk(fi) = 

9ij and fk{@) is strictly decreasing when 9 < 9ij 
and strictly increasing when 9 > 9ij. 


(of,-- - ,of) = PAV ( of, of) solves: 

min ^(log(a 2 + of) + a 2^ a2 ) 
subject to: of > of > ■■■ > of 

which is only slightly different from our original optimiza¬ 
tion problem. To finish the proof, we just need to introduce 
some auxiliary functions. Let /_*, = log (o 2 + <x/ k ) + 
,1 < k < n, so d 2 _ k = argmin f- k (o 2 _ k ) = 0. 

' — k 

Consider the following optimization problem (Q n ): 



Lemma 1 shows PAV(<x 2 , • • • , of, 0, • • • ,0) solves ( Q n ). 
Denote it (of^ • • • , o 2 npl o 2 _ nn , ■■■ , o 2 nl ). Notice that 
(of, ■ ■ ■ , of, 0, • • • , 0) is a feasible solution, which should 
be suboptimal, that is to say, 


^(log(cr 2 + &f 2i ) H- 2 V -2 ) + E 


< £(]og(o* + + ff 2^ 2 ) + E 


Recall that df_ k = argmin f-k(cr 2 _ k ) = 0, which implies 


E(!0g(0- J + O-fi) + 2^2 ) 

i= 1 m 

< J2(\og(o 2 + of) + J' 2 ) 

i= 1 ' l 

Let n goes to infinity, PAV(ct 2 ,--- ,df, 0, • • • ,0) con¬ 
verges to (dfl {d . 2 > 0) , ■ • • , ct 2 / ( ^ 2 > 0) , 0, • • • , 0) and the in¬ 
equality above implies (offi^y^,--- ,ofl^ 2 >o)) is the 
solution to the original optimization problem. □ 


The pooling condition can be easily checked and numerous 
distributions have log likelihood functions that satisfy the 
condition. These include the binomial distribution, Pois¬ 
son distribution, normal distribution with fixed variance 
and variable mean, normal distribution with fixed mean and 
variable variance, and so on. 

Proof of Proposition 2.1 

The situation we are faced up with is normal distribu¬ 
tion with fixed mean and variable variance, which satis¬ 
fies the conditions in Lemma 1. According to the lemma. 


2.4 Data-Driven Blockwise James-Stein Estimator: 

Global and Local Adaptivity 

Proposition 1 and the nature of Pool-Adjacent-Violators 
algorithm show that the MMLE (if 2 , • • • , <f 2 ) is decreas¬ 
ing and blockwise constant. We now change notation in 
this part and let of denote the common variance estimate 
of the zth block. Define /% = (fin , Pi ni ) to be the 
coefficients within the fth block and correspondingly de¬ 
fine @i, j!li. With these notations, we can write explicitly 



Y^LAPij-° 


and 


Pi 


of + o 2 


(i- 


S"ii P?i 


) + Pi i 


which is exactly the positive part of the James-Stein type 
estimator. Hence the proposed estimator can be interpreted 
as a monotone blockwise James-Stein estimator. Block- 
wise James-Stein estimator is well-studied in the wavelet 
setting (Cai [1999], Cai and Zhou [2009]). In Cai [1999], 
the block size is fixed before observing the data and is the 
same for all blocks. Cai and Zhou [2009] proposed an 
adaptive procedure to make the block size data-driven but 
the block size remains the same for all blocks. As for our 
procedure, the number of blocks and the size of each block 
are completely data-driven. The difference is due to dif¬ 
ferent assumptions. The former is based on smoothness of 
Besov bodies while the later is based on monotonicity. 

The advantage of our data-driven, monotone blockwise 
James-Stein estimator is the ability to achieve both global 
and local adaptivity. Blockwise shrinkage utilizes informa¬ 
tion about neighboring coefficients. However, if the block 
size is too large, local inhomogeneity might be overlooked. 
So, the best way to achieve a good balance is to let the data 
speak for itself. 


3 ORACLE RISK PROPERTIES 


3.1 Equivalence between MMLE and SURE 
Estimator 


In this section, we show that our empirical Bayes estima¬ 
tor can also be derived within a frequentist framework by 
minimizing Stein’s unbiased risk estimate (SURE). Under 
squared error loss: 10, p) = \Y7i=i0i ~ Pi) 2 ■ If one 
uses the shrinkage estimator /3 A defined by p x = x .+ a2 Pi 
to estimate fa, the risk for a given /3 is: 

R P 0\p) = E[10\P) I A?) 

P ' A i) 


and an unbiased estimate for the risk is 


SURE{ A) 


1 p 


2 32 


i—1 


o 2 + A , 


VP: 


a 2 (Xi - cr 2 ) 
a 2 + A i 


the monotone shrinkage parameter A = (Ai, • • • , A p ). If 
only considering the family of monotone shrinkage esti¬ 
mators, the relationship suggests that the data-dependent A 
which minimizes SURE(A) should be a good choice. De¬ 
fine: 


Xsure = arg 


min 

Ai>--->A p >0 



2 Pi- 


a 2 (X j-a 2 ) 

G 2 + A; 


which is of the same form as optimization problem (3). Let 
<7i(A») = {0^x~) 2 P 2 + g 1 • Then it is easy to see 
that A i = argmingj(Aj) = 0 2 — a 2 . Checking that gi(X t ) 

satisfy the two conditions in Lemma 2.1, the same argu¬ 
ment used to show Proposition 2.1 implies: 

Proposition 3.1. MMLE equals SURE estimator P^sure. 


In the rest of the paper, we will use Psure = p XsuRE to 
denote the proposed estimator. 

Remark 3.1. Monotone shrinkage estimator was also in¬ 
vestigated in Xie et al. [2012] when dealing with het- 
eroscedastic normal sequence model. Different empirical 
Bayes estimators were studied in this paper and SURE es¬ 
timator was shown to dominate MMLE and method of mo¬ 
ments. While in our context, the three estimators turned out 
to be the same. 


3.2 Oracle Property with Order Assumption 


Proposition 3.1 provides us with a powerful tool to inves¬ 
tigate the risk properties of the proposed estimate. First of 
all, we introduce the oracle estimator, namely the Bayes 
rule (3* = {Pi,-- - ,Pp) defined by 


p: = 


of + o ■ 


;Pi 


Of course, /?* is not an practical estimator because it de¬ 
pends on the unknown parameter A* = (of, ■ ■ ■ ,cr 2 ). It is 
easy to see the oracle risk is: 


i P 2 2 

p (T 2 + of 
1=1 1 


Then we introduce another lemma, which is the building 
block of the oracle properties. It says that E[SURE{\)] 
is uniformly good approximation of the true risk E[l0fa)], 
where the expectation is with respect to both data and pa¬ 
rameter fa 

Lemma 3.1. 


Generally, SURE(A) is unbiased estimate of the risk only if 
A is a fixed constant and cannot depend on data. We say p x 
is a monotone shrinkage estimator i f ft = j^ffPi and 

Ai > • • • > A p > 0, where A , can be data dependent. A 
monotone shrinkage estimator is completely determined by 


sup sup \E{E[10 X ,P)-SURE{X)\P]}\ < L, 2 o 2 

Xl>-->X P » P 

where A = (Ai, • • • , \ v ) is arbitrary monotone shrinkage 
parameter and can be data dependent. 



Theorem 3.1. 


Take expectations, we have 


sup (. R0sure ) - R(P*)) < 4< f-o 2 

Vl>-><7%>0 V P 

Proof: Because A* is fixed constant, we have 

R0SURE , P) - R(P*, p) 

= E[10sure,P)\P) - E[SURE(X*)\P\ 

= E[10sure,P) ~ SURE(X S ure)+ 
SURECXsure) ~ SURE(X*)\P] 

< E[1(Psure,P ) - SURE(Xsure)\P] 

The inequality is due to the definition of A sure- So the 
Bayes risk satisfies: 

R{Psure ) - R{P*) 

< E{E[1(Psure, P) ~ SURE(Xsure)\P ]} 

< sup sup \E{E[l(p % ,p)-SURE(X)\p}}\ 

of- Ai>-->Ap>0 


R{Psure ) - R(fa) < E{E[l(P, Psure)- 
SURE(Xsure) + SURE(fj) - l(P,fa)\P}} 

<2 sup sup \E{E[l(P~ x ,P)-SURE{X)\p]}\ 

of — ,ol Ai>--->Ap>0 

Lemma 3.1 implies, 

R{Psure ) — inf R(Py) < 8i j-o 2 + e 

7l>->7p>0 V P 

Since the upper bound does not depend on erf, let e —* 0, 
and the theorem follows. □ 

If we replace the data dependent shrinkage parameters in 
Theorem 2 with fixed ones, we can improve the error bound 
by a factor of 2, which is 

Corollary 3.2. 

sup (R(Psure) ~ inf -R(/3 7 )) < 4-i/—cr 2 
ct 2 ... j0 .2 7l>-">7p>0 V P 


Applying lemma 3.1 finishes the proof. □ 

Remark 3.2. Theorem 3.1 shows that SURE estimator 
mimics the oracle Bayes rule and therefore outperforms all 
other estimators. What needs to be highlighted is that this 
is a non-asymptotic result with rate of convergence 0 (p~ 2 ) 
independent of the true erfs. No matter how the erfs vary, 
as long as the order is known, the proposed adaptive pro¬ 
cedure can uniformly capture the truth. 

Corollary 3.1. sup = 1 + 4,/| 

of>-><?l> 0 V 

3.3 Oracle Property without Order Assumption 

In this section, we show that even without knowing the or¬ 
der of the erf’s, the proposed estimator retains an oracle 
property among monotone shrinkage estimators. 

Theorem 3.2. 

sup ( R0SURE ) - inf R(P^)) < 8 \ -a 2 

0-2,... i(T 2 7l>'">7p>0 VP 

Proof: For any given (of, ■ ■ ■ ,of), we can always find 
f) = (f/i, • • • ,r) p ) that satisfies fji > • • • > fj p > 0 and 
R(Pfj) < inf R(Py) + £• Then, 

7!>->7p>0 

R(Psure)— inf R(Py) < R(Psure) — R{Ptj)+e 

7l >...>7 P >o 

Notice that, 

l(P, Psure)~ 1(P, Pfj) = (1 (P,Psure)~SURE(Xsure 

+ (SURE(XsuRE)-SURE(p))+(SURE(fi)-l(p,pr ) )) 

< ( l(P , PsuRE)-SURECXsuR E ))+(SURE(fj)-l(P, fa) 


Theorem 2 shows that even when the order assumption is 
invalid, the proposed estimator is nearly the best in the 
family of monotone shrinkage estimators. In particular, 
uniform shrinkage estimators such as least square estima¬ 
tor, ridge estimator and James-Stein estimator and step¬ 
wise regression methods such as monotone AIC, BIC, RIC 
(just search for p nested submodels: with ith submodel as 
{1, • • • , *}) are included. This is also a non-asymptotic re¬ 
sult with rate of convergence 0 (p~ 2 ) independent of the 
erfs. Therefore, the proposed estimator is robust and good 
enough for practical use. Actually, Theorem 3.2 states 
about the worst case. If the order is partially right, the 
proposed procedure benefits where the order is right and 
retains good properties where the order is wrong. 

Remark 3.3. The robustness is due to the ‘soft constraint ’. 
Instead of restricting the norm of regression coefficients to 
be monotone, we incorporate the constraint in the prior 
distribution, which makes the model flexible and robust. 

4 ESTIMATION OF a 2 

We have assumed o 2 is known to establish the theoretical 
properties of our estimator. Here we suggest a reasonable 
estimate in practice that is based on maximum marginal 
likelihood. Unlike section 2.2, within this section the un¬ 
known parameter becomes 6 = (of, ■ ■ ■ ,of,o 2 ). Recall 
that the marginal distribution of Y is N( 0, X'EX 1 '+o 2 I n ). 
So, the log marginal likelihood function: 

l(0\y) oc -log(\XXX T +o 2 I n \)-y T (XXX T +o 2 I n )- 1 y 

where j • | means determinant. Let X = (xi,--- ,x p ), 
we can add another n — p vectors x p +i, • • • , x n to make 



X = (A", ajp+i, ■ ■ ■ , x n ) an orthonormal matrix. Let E = 
diag{Y, + a 2 I p ,cr 2 I n - p ). Then AEX T + a 2 I n = XtX T 
and thus (AEA T + cr 2 In) -1 = AE _1 X T . Plug this ex¬ 
pression back into the marginal likelihood function, 

l{e\y) (X -log(\XtX T \) - y T Xt~ 1 X T y 

We abuse notation in this section and let f} = X T y. If 
we introduce variable (rf,, r 2 ) = diag(T,) = (cr 2 + 
a 2 , - ■ ■ , cr 2 + cr 2 , (7 2 , • • • , a 2 ), then 

K s ) « -^(iogr 2 + 

»=i * 

So, the MMLE is the solution to the following optimization 
problem. 


min^^ogr 2 + 

i= 1 ^ 

subject to: if > • ■ ■ > r 2 > r 2 +1 = ■ • • = r 2 > 0 

Following a similar but slightly different argument in 
Proposition 2.1, we obtain the following result and omit 
the proof. 

Proposition 4.1. I'he solution is uniquely given by 


(a j 11 1 ) E? 1 A, 

PAV0 2 ,- 


B 2 

5 rp 5 


n — p 


n — p 


77ze MMLE of original parameters can be recovered 

by CaV" ,r 2 f 2 +1 ,--- ,f 2 ) = (<r? + d 2 ,-- 

* 2 ,*V-- ,<r*). 


. + 


5 NUMERICAL EXPERIMENTS 

5.1 Simulation Results 

In this section, we compare the proposed monotone shrink¬ 
age approach with several other popular methods for fea¬ 
ture selection and estimation. For simplicity, we only con¬ 
sider the normal sequence model and assume the error vari¬ 
ance cr 2 is known. 

• PAV, the proposed adaptive monotone shrinkage pro¬ 
cedure computed by Pool-Adjacent-Violators algo¬ 
rithm. 

• Lasso, with A selected by minimizing Stein’s unbi¬ 
ased risk estimate. Under orthogonal design, it is also 
known as Sureshrink (Donoho and Johnstone [1995]). 

• Ridge estimator with A selected by Cross-Validation. 

• Positive part of James-Stein estimator 

• Classical stepwise regression, we use AIC for penalty 
criterion. 


• Monotone AIC: AIC that just searches for p nested 
submodels, i.e., with kth submodel={l, • • • , fc} 

We consider the following scenarios (p = 100, cr 2 = 1): 

1. Signals with Decaying Size: (cr 2 , • • • , cr 2 ) are gener¬ 
ated from decreasing order statistics of 2x 2 . 

2. Signals with Same Size: a 2 = 2, VI < i < p 

3. Sparse Signals: first 90% of the cr? are 0 and remain¬ 
ing 10% of cr 2 are generated from decreasing order 
statistics of 4\ 2 . 

4. Signals with Increasing Size: (cr 2 , • • • , a 2 ) are gener¬ 
ated from increasing order statistics of 2\- 2 . This sce¬ 
nario dose not satisfy our order assumption(actually, 
the worst case), which is used to show the robustness 
of our procedure. 

With (cr 2 , • • • , cr 2 ) fixed, we adopt the following simula¬ 
tion strategy. 

L Generate /? == (/3 X , • • • , /3 P ) by Bi ~ N( 0, cr 2 ) 

2. Condition on /3, generate the observation X = 
(xi , * * * ,x p )by Xi ~ N{Bi,a 2 ) 

3. Use the methods discussed above to estimate the sig¬ 
nal B and compute the mean square error. 

4. Repeat 1-3 for 400 times. The average of the mean 
square errors is an estimate of the Bayes risk. 

Mean square error for different /? are given below using box 
plot and the middle line of each box represents Bayes risk 
of each procedure. The red line in the figure stands for the 
oracle risk, i.e., the Bayes risk of the oracle Bayes rule. 



PAV James-Stein Ridge Lasso Montone AIC AIC 


Figure 1: Signals with decaying size, i.e. {a 2 } is decreasing. The 
adaptive monotone shrinkage procedure pools signals of similar 
sizes together and shrinks blockwisely and monotonically. 


For signals with decaying size, oracle estimator shrink 
monotonically with respect to the size of the signals. Uni¬ 
form shrinkage estimators such as ridge and James-Stein 
estimator are suboptimal. The proposed adaptive monotone 
procedure makes use of the prior information and mimics 
the oracle Bayes mle by pooling signals of similar size to¬ 
gether so that it shrinks blockwisely and monotonically. 
AIC overfits the data while monotone AIC makes use of 
the order structure and therefore performs better. 


































PAV James-Stein Ridge Lasso Montone AIC AIC 


Figure 2: Signals with same size, i.e. of s are the same. This is 
the case where ridge and James-Stein estimator capture the truth 
with full power while our procedure will regard the of as differ- 
ent(decreasing) and will generally divide the of s into more than 
one blocks, which leads to slight power loss. 

For signals with same size, the oracle estimator shrink uni¬ 
formly. Ridge and James-Stein estimator mimic the oracle 
Bayes rule with full power. The proposed adaptive proce¬ 
dure does not necessarily gaurantee uniform shrinkage but 
the power loss is negligible. 



PAV James-Stein Ridge Lasso Montone AIC AIC 


Figure 3: Sparse Signals, i.e. the size of the signals remain 
decreasing while 90% of them are 0. The proposed monotone 
shrinkage procedure can effectively kill the noise and shrink the 
signals properly. 

For sparse signals, the oracle estimator kill the noise and 
shrink the signals monotonically. Monotone AIC can effi¬ 
ciently distinguish signal and noise while does not shrink 
the signals. The proposed adaptive procedure not only kills 
the noise but also shrinks the real signals properly accord¬ 
ing to their sizes. For those methods that cannot make use 
of the order structure. Lasso does better in this sparse case. 



-n-1-n-1-1-1— 

PAV James-Stein Ridge Lasso Montone AIC AIC 


Figure 4: Signals with increasing size, i.e. {erf} is increasing, 
which is opposite to our assumption that the size of signals is 
decaying. The adaptive monotone shrinkage procedure is robust 
and performs as good as other estimators. 


For signals with increasing size, the proposed estimator 
uses completely reverse order. As theorem 2 expects, 
wrong prior knowledge won’t min our estimator. It still 
mimics the best performance of the monotone shrinkage 
family. However, monotone AIC, which is not as robust as 
our procedure, suffers a lot from wrong prior knowledge. 

5.2 Analysis of Text Processing Data 

In this section, we apply the proposed adaptive monotone 
shrinkage approach to text data of real estate described in 
Foster et al. [2013]. The features included in the regres¬ 
sion model are the leading 1500 principal components of 
the bag-of-words of text. The response is the log transfor¬ 
mation of the real estate price. We use the eigenvalues from 
PCA to order the effect size of the features (see Figure 5 for 
the absolute t-statistics of the leading 500 principal compo¬ 
nents). Although the data dose not ideally satisfy the as¬ 
sumptions of our model, the proposed adaptive procedure 
is robust enough to leverage this rough prior knowledge. 


Absolute t-statistics of the leading 500 Principal Components 



Figure 5: Absolute t-statistics of the leading 500 Principal Com¬ 
ponents. Those above the red line are significant. 

The sample size is 7384 and we use 10 fold cross validation 
to estimate the prediction error of each procedure. 

Prediction Error by 10 fold Cross Validation 



Figure 6: Prediction error comparison of different methods. Las- 
soSURE: Lasso with tunning paramter selected by minimizing 
Stein’s unbiased risk estimate. LassoCV: Lasso computed by 
LARS (Efron et al. [2004]) and paramter tuned by cross valida¬ 
tion. 













































































The result shows that 

• PAV outperforms Ridge regression. From Figure 5, 
we can see that the signals are of different sizes. Uni¬ 
form shrinkage method shrink the important features 
too much while shrink weak signals less harshly than 
it should be. PAV can adaptively pool the signals 
of similar size together and shrink blockwisely and 
monotonically. 

• PAV outperforms LassoSure and LassoCV. Lasso can 
capture the sparse pattern of the data but as sacrifice, 
it might shrink important features a bit more than they 
should be. 

• PAV outperforms Monotone AIC. Both procedures 
make use of prior information but PAV is more robust. 
There are several informative principal components 
corresponding to small eigenvalues so that Monotone 
AIC will exclude them from the model. 

6 CONCLUSIONS 

In this paper, we proposed an adaptive monotone shrink¬ 
age approach for regression with features of ordered effect 
size. We showed that the procedure can be rapidly com¬ 
puted via Pool-Adjacent-Violators algorithm and holds ora¬ 
cle risk properties. Non-asymptotic results are established. 
Furthermore, although the procedure is based on knowing 
the right prior knowledge about the features, we proved 
that, when the prior knowledge is wrong or in the absence 
of prior knowledge, the estimator still mimics the best per¬ 
formance of the family of monotone shrinkage estimators. 
Hence, it is robust enough to use in practice. 

Compared with penalized least square methods which re¬ 
quire heavy computational effort to find the best regular¬ 
ization paramter, the proposed adaptive procedure is tuning 
free. As noticed in the analysis of text data, the monotone 
shrinkage approach naturally works with PCA since the 
principal components are essentially ordered and orthogo¬ 
nal. Recent developments in randomized algorithms(Halko 
et al. [2011]) enable us to quickly compute the PCA of a 
huge matrix so that the proposed procedure can be easily 
applied to large-scale datasets. 

7 APPENDIX 

7.1 Proof of Lemma 2.1 

It is sufficient to prove the following two claims: 

i) (01, • • • , ^)=PAV(0i, • • • , 0 k ), l^k<p 

ii) VI <i,j < k, Of = Of => 9 f = 0f,Mk < m < p 

We prove the claim by induction: 

1. It is trivial for k = 1 since 0\ = 0\ 


2. Assuming the claim holds for k and we are going to 
prove it is valid for k + 1 as well. 

Case 1 9 k+ 1 < Of. It implies that 


E/*(<??)+/*+ 1 &+ 1 ) 


i=1 


= n rnir U fi( e i ) + min/fc+i(0fe+i) 

>k —" Vk + 1 

Z = 1 


k +1 

< min fi(0i 

0i>->0fc+i-4v V 

1=1 


and therefore. 


//i/c+1 r\k+ 1\ _ (nk r\k 7\ \ 

lyi — (yi>*" 5 

= PAv(o ir -- ,e k ) 


So we prove claim i). Notice that 0fX\ f 0f +1 , Vj < k, 
claim ii) is true by induction. 

Case 2 0 k +\ > Of. 

Denote j the smallest integer such that Of = 0f +1 = 
• • • = Of. Because the boundary is not active between 
0j -1 and 0j, by the pooling condition, we can conclude 
that (0f,0f +1 ,--- ,0f) = arg min Zf=j M°i ) and 

U j V k 

of = ■■■ = Of = Zlj Oi/(k - j + 1) < 0 k+ 1 . We 
claim: 0™ = 6j+ 1 = • • • = 9™ + i,Vk < m < p. By induc¬ 
tion we have already known that O'J 1 = 0f x +l = • • • = 0™. 
If < 0 k +i, then by pooling condition, f k+1 (0 k+ i) is 
strictly decreasing on (0,0]?), which forces 0™ +1 = Of 1 . 
If Of 1 > 4+i, then 0Jf +1 > 0 k+1 since f k +i(0) is uni- 
modal and achieves minimum at 0 k +i- Because 0 k+ 1 > 
Zi=j 0i/{k — j + 1), again the pooling condition forces 
Of = 0j\ 1 = --- = Of = Ztj 0i/(k-j + 1) and hence 
Of = 0f + i = ■ ■ ■ = Of = 0f + 1 - Therefore we proved 
Of =0f +1 = ■■■ = 0f +1 ,Vm > k. Specifically, §f +1 = 
oftl = ■■■ = ofll If Z k i=f 0i/(k - j) < 0f_ } , still by 
pooling condition, 0f +l = • • • = 0f+\ = Zf=j 0i/(k-j) 
and consequently 0f +1 — Of, 1 < i < j — 1. We are done 
because the solution is exactly PAV(#i, • • • , 0 k +i), which 
proves claim i) and Of = 0f +1 = ■ ■ ■ = 9f +l .\/m > k 

implies claim ii). If ff - > Of-i, assume i to be the 

smallest integer such that Of = 0f +1 = ■ ■ ■ = 0f_ v By 
similar argument, we can prove that Of = 0f_ 1 = ■ • ■ = 

0fc!|_i,Vm > k. If ZtZi 0t/{k — i) < 0f_ i, we are done. 
If not, continue the same argument. □ 



7.2 Proof of Lemma 3.1 

Plug in the expression of SURE(A), we have 


E[1(P X ,P)-SURE(\)\0] 




2A; 


A 


(ft 2 - Pifc - a 2 ) 


-0f-a 2 -p 2 m = e[- y 

P i=1 <T + A i 

Take expectation with respect to /?, we get 

E{E[l(p x ,p)-SURE(Xm} = 

E{E[~ Y ^V(& 2 - AA - Olft} 


P + A* 


Notice that ft|ft ~ ) and the marginal 

distribution of ff is N(0, a 2 + of), we change the order of 
expectation and get: 


Let Mj = Zi =1 ( z i ~ 1)’ then Mj is a martingale. So the 
L 2 maximal inequality implies: 

E( max Mf) < 4 E(M 2 ) = 8 p 
i <j<p 


\E{E[l(P^,P) — SURE(\)\P}}\ 

<2a 2 f?{maX p |i^(Z l -l)|} 

^ 7 (E( i<S Mi)2) ' 

Combine the two inequalities, we have 

I E{E[l(P x ,P) - SURE(X)\p]}\ < 4^a 2 

Since the error bound does not depend on A and of, the 
lemma follows. □ 


E{E[l{P k ,P)-SURE( A)|ft} = 


Pi-A\ 


‘P^cr 2 + \i'<? 2 + <?t 


where the expectation is with respect to ft ~ N(0, a 2 + 


\E{E[l(P x ,P)-SURE(X)\P]}\ < 




P^[a 2 + \i 

<2 o 2 E{ sup *|Y>( ^ 2 - 1)|} 

l>ci>->c p >0 P ° 2 + °i 

i p 

= 2 cr 2 E{ sup -| Y c d z i ~ 1)1} 

l>ci>-”>c p >0 P 


where Zi ~ i.i.dx 2 ■ Observe that 


sup 


i>(z,. 01 


l>ci>-">c p >0 P i=1 

1 j 

= max I - V (Zi — 1)1 
1=1 

which is also used in Lemma 7.2 of Li [1985] and Theorem 
3.1 in Xie et al. [2012], we have: 


\E{E[1(P‘ X ,P)-SURE(X)\P]}\ 


< 2o 2 E{ max |-^(Z i -l)|) 
1 <]<P P 

1=1 


References 

Felix Abramovich, Yoav Benjamini, David L Donoho, and 
Iain M Johnstone. Adapting to unknown sparsity by con¬ 
trolling the false discovery rate. The Annals of Statistics, 
34(2):584-653, 2006. 

Miriam Ayer, H Daniel Brunk, George M Ewing, WT Reid, 
and Edward Silverman. An empirical distribution func¬ 
tion for sampling with incomplete information. The An¬ 
nals of Mathematical Statistics, pages 641-647, 1955. 

Peter J Bickel, Ya’acov Ritov, and Alexandre B Tsybakov. 
Simultaneous analysis of lasso and dantzig selector. The 
Annals of Statistics, pages 1705-1732, 2009. 

H. D. Brunk. Maximum likelihood estimates of monotone 
parameters. The Annals of Mathematical Statistics, 26 
(4):607-616, 1955. 

H. D. Brunk. On the estimation of parameters restricted 
by inequalities. The Annals of Mathematical Statistics, 
pages 437-454, 1958. 

T Tony Cai. Adaptive wavelet estimation: a block thresh¬ 
olding and oracle inequality approach. Annals of statis¬ 
tics, pages 898-924, 1999. 

T Tony Cai and Harrison H Zhou. A data-driven block 
thresholding approach to wavelet estimation. The Annals 
of Statistics, pages 569-595, 2009. 

Emmanuel Candes and Terence Tao. The dantzig selector: 
statistical estimation when p is much larger than n. The 
Annals of Statistics, pages 2313-2351, 2007. 

Lee Dicker. Dense signals, linear estimators, and out-of- 
sample prediction for high-dimensional linear models. 
arXiv preprint arXiv: 1102.2952, 2011. 



Lee Dicker. Optimal estimation and prediction for dense 
signals in high-dimensional linear models. arXiv 
preprint arXiv:1203.4572, 2012. 

David L Donoho. De-noising by soft-thresholding. Infor¬ 
mation Theory, IEEE Transactions on, 41(3):613—627, 
1995. 

David L Donoho and Iain M Johnstone. Adapting to 
unknown smoothness via wavelet shrinkage. Journal 
of the american statistical association, 90(432): 1200- 
1224, 1995. 

Bradley Efron, Trevor Hastie, Iain Johnstone, Robert Tib- 
shirani, et al. Least angle regression. The Annals of 
statistics, 32(2):407^199, 2004. 

Jianqing Fan and Jinchi Lv. Sure independence screening 
for ultrahigh dimensional feature space. Journal of the 
Royal Statistical Society: Series B (Statistical Methodol¬ 
ogy), 70(5):849-911,2008. 

Dean P Foster and Edward I George. The risk inflation cri¬ 
terion for multiple regression. The Annals of Statistics, 
pages 1947-1975, 1994. 

Dean P Foster, Mark Liberman, and Robert A Stine. Fea- 
turizing text: Converting text into predictors for regres¬ 
sion analysis. 2013. 

SJ Grotzinger and C Witzgall. Projections onto order sim- 
plexes. Applied mathematics and optimization, 12(1): 
247-270, 1984. 

Nathan Halko, Per-Gunnar Martinsson, and Joel A Tropp. 
Finding stmcture with randomness: Probabilistic algo¬ 
rithms for constructing approximate matrix decomposi¬ 
tions. SIAM review, 53(2):217-288, 2011. 

Peter Hall, Jiashun Jin, and Hugh Miller. Feature selection 
when there are many influential features. arXiv preprint 
arXiv:0911.4076, 2009. 

Ker-Chau Li. From stein’s unbiased risk estimates to the 
method of generalized cross validation. The Annals of 
Statistics, pages 1352-1377, 1985. 

Robert Tibshirani. Regression shrinkage and selection via 
the lasso. Journal of the Royal Statistical Society. Series 
B (Methodological), pages 267-288, 1996. 

John Wright, Allen Y Yang, Arvind Ganesh, Shankar S 
Sastry, and Yi Ma. Robust face recognition via sparse 
representation. Pattern Analysis and Machine Intelli¬ 
gence, IEEE Transactions on, 31(2):210—227, 2009. 

Xianchao Xie, SC Kou, and Lawrence D Brown. Sure esti¬ 
mates for a heteroscedastic hierarchical model. Journal 
of the American Statistical Association, 107(500): 1465- 
1479, 2012.